Introduction

This document is explanatory data analysis of Red Wines dataset. This dataset contains chemical/physical properties of wines, unique id-s and quality parameter marked by professionals.

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## Observations: 1,599
## Variables: 13
## $ X                    (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity        (dbl) 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity     (dbl) 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid          (dbl) 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar       (dbl) 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides            (dbl) 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide  (dbl) 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide (dbl) 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density              (dbl) 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH                   (dbl) 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates            (dbl) 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol              (dbl) 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality              (int) 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...

There are 1,599 records in dataset and 13 variables.

Univariate Plots Section

Histograms

Lets look on histograms of all variables in dataset (except X)

Non-linear histograms

Total sulfur dioxide distriburtion looks like normal in log10 scale. I’ll add log10 of total.sulfur.dioxide to dataset for future use.

Lets try to add more variables, non-free sulfur dioxide which is the difference between total and free sulfur dioxides.

Additional variables

Almost the same as the total sulfur dioxide

95th percentile

Because there are several variables with heavy tails it’s interesting to take a look on distributions without these tails.

Univariate Analysis

What is the structure of your dataset?

There are 1599 vines with 12 features: ( “fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, chlorides“,”free.sulfur.dioxide“,”total.sulfur.dioxide“,”density“,”pH“,”sulphates“,”alcohol“,”quality" ) and X (id) labels

Most wines have quality 5 and 6 (Neutral?) Only few wines have quality 8, even fewer - 3. There are no wines with quality > 8 or quality < 3. Its

There are no wines with alc. less 8.5 Most wines have more than 9% of alcohol and less than 13%

Normal distributions:

  • pH
  • density
  • quality

Close to normal distribution, with some outliers in right tail:

  • sulphates
  • volatile acidity
  • fixed acidity

Skewed to left distributions:

  • citric acid
  • residual sugar
  • chlorides
  • free.sulfur.dioxide
  • total.sulfur.dioxide
  • alcohol

What is/are the main feature(s) of interest in your dataset?

Most interesting in this dataset is quality of wine and and basic chemical characteristics (alcohol, pH, acids, sulphates)

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think it will be interesting to figure out why good wine is good and why bad wine is bad. Because most of wines have average quality - 5 and 6, I’m going to look in detail to wines with low quaility - 3,4 and high - 7,8.

Did you create any new variables from existing variables in the dataset?

I added non-free sulphur dioxide, but it seems that it is not very helpful. Also I added total sulphur dioxide in log10 scale, because it may be interesting for future investigation.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There we a number of parameters with heavy tails. Replotting them without 5% of tail values, allowed to understand better real distribution.

Long tails

I draw histograms of 95th percentile of total sulfur dioxide, sulphates, chlorides, volatile acidity and residual sugar. Sulphates, volatile acidity, fixed acidity have distribution close to normal. Removing 5% of largest values made distributions much closer to normal. Same procedure for chlorides, made histogram normal. It is interesting to take a look on these outlier.

Citric acid

It seems that citric acid have a number of zeros. Lets calculate number of zeros and percent of such records

## [1] 132  16
## [1] 0.08255159 1.00000000

There are 132 records (8,2% of wines) that have citric acid equal to zero. According to this artice:

Citric acid is often added to wines to increase acidity, complement a specific flavor or prevent ferric hazes. It can be added to finished wines to increase acidity and give a “fresh” flavor.

So it is interesting to review dependence between citric acid and quality of wine.

Log10 scale

I plotted several variable in log10 scale. After that I figured out that total sulfur dioxide distriburtion looks like normal in this scale.

Bivariate Plots Section

Let’s start with correlations between our parameters.

##                                       X fixed.acidity volatile.acidity
## X                           1.000000000   -0.26848392     -0.008815099
## fixed.acidity              -0.268483920    1.00000000     -0.256130895
## volatile.acidity           -0.008815099   -0.25613089      1.000000000
## citric.acid                -0.153551355    0.67170343     -0.552495685
## residual.sugar             -0.031260835    0.11477672      0.001917882
## chlorides                  -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide         0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide       -0.117849669   -0.11318144      0.076470005
## density                    -0.368372087    0.66804729      0.022026232
## pH                          0.136005328   -0.68297819      0.234937294
## sulphates                  -0.125306999    0.18300566     -0.260986685
## alcohol                     0.245122841   -0.06166827     -0.202288027
## quality                     0.066452608    0.12405165     -0.390557780
## log10.total.sulfur.dioxide -0.122541052   -0.11789982      0.073407103
## nonfree.sulfur.dioxide     -0.178263036   -0.07814929      0.097033939
##                             citric.acid residual.sugar    chlorides
## X                          -0.153551355   -0.031260835 -0.119868519
## fixed.acidity               0.671703435    0.114776724  0.093705186
## volatile.acidity           -0.552495685    0.001917882  0.061297772
## citric.acid                 1.000000000    0.143577162  0.203822914
## residual.sugar              0.143577162    1.000000000  0.055609535
## chlorides                   0.203822914    0.055609535  1.000000000
## free.sulfur.dioxide        -0.060978129    0.187048995  0.005562147
## total.sulfur.dioxide        0.035533024    0.203027882  0.047400468
## density                     0.364947175    0.355283371  0.200632327
## pH                         -0.541904145   -0.085652422 -0.265026131
## sulphates                   0.312770044    0.005527121  0.371260481
## alcohol                     0.109903247    0.042075437 -0.221140545
## quality                     0.226372514    0.013731637 -0.128906560
## log10.total.sulfur.dioxide -0.003637462    0.147471411  0.060221933
## nonfree.sulfur.dioxide      0.066776040    0.174529035  0.055479649
##                            free.sulfur.dioxide total.sulfur.dioxide
## X                                  0.090479643          -0.11784967
## fixed.acidity                     -0.153794193          -0.11318144
## volatile.acidity                  -0.010503827           0.07647000
## citric.acid                       -0.060978129           0.03553302
## residual.sugar                     0.187048995           0.20302788
## chlorides                          0.005562147           0.04740047
## free.sulfur.dioxide                1.000000000           0.66766645
## total.sulfur.dioxide               0.667666450           1.00000000
## density                           -0.021945831           0.07126948
## pH                                 0.070377499          -0.06649456
## sulphates                          0.051657572           0.04294684
## alcohol                           -0.069408354          -0.20565394
## quality                           -0.050656057          -0.18510029
## log10.total.sulfur.dioxide         0.713535755           0.92313740
## nonfree.sulfur.dioxide             0.425148917           0.95768634
##                                density          pH    sulphates
## X                          -0.36837209  0.13600533 -0.125306999
## fixed.acidity               0.66804729 -0.68297819  0.183005664
## volatile.acidity            0.02202623  0.23493729 -0.260986685
## citric.acid                 0.36494718 -0.54190414  0.312770044
## residual.sugar              0.35528337 -0.08565242  0.005527121
## chlorides                   0.20063233 -0.26502613  0.371260481
## free.sulfur.dioxide        -0.02194583  0.07037750  0.051657572
## total.sulfur.dioxide        0.07126948 -0.06649456  0.042946836
## density                     1.00000000 -0.34169933  0.148506412
## pH                         -0.34169933  1.00000000 -0.196647602
## sulphates                   0.14850641 -0.19664760  1.000000000
## alcohol                    -0.49617977  0.20563251  0.093594750
## quality                    -0.17491923 -0.05773139  0.251397079
## log10.total.sulfur.dioxide  0.10553948 -0.01483664  0.069754799
## nonfree.sulfur.dioxide      0.09513464 -0.10805328  0.032244043
##                                alcohol     quality
## X                           0.24512284  0.06645261
## fixed.acidity              -0.06166827  0.12405165
## volatile.acidity           -0.20228803 -0.39055778
## citric.acid                 0.10990325  0.22637251
## residual.sugar              0.04207544  0.01373164
## chlorides                  -0.22114054 -0.12890656
## free.sulfur.dioxide        -0.06940835 -0.05065606
## total.sulfur.dioxide       -0.20565394 -0.18510029
## density                    -0.49617977 -0.17491923
## pH                          0.20563251 -0.05773139
## sulphates                   0.09359475  0.25139708
## alcohol                     1.00000000  0.47616632
## quality                     0.47616632  1.00000000
## log10.total.sulfur.dioxide -0.23085802 -0.17014272
## nonfree.sulfur.dioxide     -0.22320257 -0.20546298
##                            log10.total.sulfur.dioxide
## X                                        -0.122541052
## fixed.acidity                            -0.117899816
## volatile.acidity                          0.073407103
## citric.acid                              -0.003637462
## residual.sugar                            0.147471411
## chlorides                                 0.060221933
## free.sulfur.dioxide                       0.713535755
## total.sulfur.dioxide                      0.923137400
## density                                   0.105539483
## pH                                       -0.014836642
## sulphates                                 0.069754799
## alcohol                                  -0.230858016
## quality                                  -0.170142719
## log10.total.sulfur.dioxide                1.000000000
## nonfree.sulfur.dioxide                    0.846502529
##                            nonfree.sulfur.dioxide
## X                                     -0.17826304
## fixed.acidity                         -0.07814929
## volatile.acidity                       0.09703394
## citric.acid                            0.06677604
## residual.sugar                         0.17452903
## chlorides                              0.05547965
## free.sulfur.dioxide                    0.42514892
## total.sulfur.dioxide                   0.95768634
## density                                0.09513464
## pH                                    -0.10805328
## sulphates                              0.03224404
## alcohol                               -0.22320257
## quality                               -0.20546298
## log10.total.sulfur.dioxide             0.84650253
## nonfree.sulfur.dioxide                 1.00000000

Most parameters seems to be uncorrelated. There are some correlations ~ 0.67 between fixed.acidity and citric.acid, almost same correlation between free.sulfur.dioxide and total.sulfur.dioxide. Anticorrelation ~ 0.68 is here between pH and fixed.acidity. Highest corelation of quality with alcohol is just 0.48. So, there are no direct dependence of quality from one phys-chem parameter of wine.

Lets take a look on mentioned dependecies

First two graphs didn’t say something new, correlation is clearly visible. As mentioned in Wikipedia:

pH is defined as the decimal logarithm of the reciprocal of the hydrogen ion activity

So it should have better correlation with log10 of fixed.acidity.

## [1] -0.7063602

Now anticorrelation even higher: 0.706

Although there are no good correlation, if we look on outliers from main group, it’s easy to see that sweet wines have less alcohol, and opposite wines which contain more alcohol, have lower sugar, because this sugar is converted to alcohol during wine preparation. One outlier there - wine which contain 14,9% of alcohol and ~ 8% of sugar. May some additional alcohol were added to this wine during production.

Because quality is the most interested feature, let’s take a look on quality plots

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality has low correlation with other parameters, it has some correlation with alcohol, it’s fun, but this correlation is not strong. On boxplot quality vs citric acid is easy to find that wines, that contain more citric acid, but wines with extreme amount of this acid have low quality. Also there are obvious dependency of quality on suplphates: best wines have larger median value, and anti-correlation on volatile.acidity, it low for best wines.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I found good negative relationship between logarthithm of fixed acidity and pH. Although it’s expectable from pH formula.

What was the strongest relationship you found?

There are some correlations ~ 0.67 between fixed.acidity and citric.acid, almost same correlation between free.sulfur.dioxide and total.sulfur.dioxide. Not very strong. Another moment, that best wines have slightly higher average alcohol. I didn’t mention correlations between nonfree.sulfur.dioxide, log10.total.sulfur.dioxide and total.sulfur.dioxide, because first two features were built using last one. As mentioned earlier best correlation between log10 of fixed.acidity and pH = 0.706

Multivariate Plots Section

Start with scatter plots on features that have visibe influence to quality: volatile.acidity, suplphates, alcohol, citric.acid

Now take a look on same plots, but only for good and bad wines, where quality < 5 and > 6

## [1] 280  17

There are 280 items in dataset without average wines

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When I plotted quality data for 4 previously selected features volatile.acidity, suplphates, alcohol and citric.acid on quality, I found that on multivariate graphs dependencies are clear. Because main area of my interest were dependence on quality, I filtered out wines quality 5 and 6 to clearly see when which wines are good, and which are bad. And last bunch of plots have clearly visible groups of different parameters for good and bad wines.

Were there any interesting or surprising interactions between features?

It is interesting that amounts of different acids: citric, acetic and tartaric determine taste and quality of wine. For example best wines tends to have citric acid 0.25-0.5 g/dm^3, but volatile acidity (acetic acid) less than 0.4 g/dm^3. And increasing of citric acid will not make wine bad, but it’s easy if there is a lot of acetic acid in wine.


Final Plots and Summary

Plot One

Description One

Distribution of quality of wine. Very close to normal. Experts avoid using low < 3 and high > 8 marks. Why are there not present ratings 0-2 and 9-10? I think it’s possible that expert will rank some wine with rating 1 or 9, but this ratings are medians of at least 3 evaluations made by wine experts. and chances that all experts will put same (very high or very low) score is really small. As for me this fact demonstrates, that wine quality metric highly depends on personal taste, so it will be hard to find some strong correlations or build analytical model.

Plot Two

Description Two

This plot show 4 characteristics of selected wines (where quality 4 and less or 7 and more). It’s clear that good wines usually have lower volatile acidity with higher amount of sulphates. Another noticable moment, that bad wines have smaller amount of alcohol.

Plot Three

Description Three

All these 4 paramaters can provide some invormation about quality of wine. So, to be the best wine should:

  • have not very high density with not very high ~ 0.75 g/dm^3 of sulphates.
  • have volatile acidity less than 0.4 g/dm^3.
  • have quite high amount of alcohol (> 11%)
  • and ~ 0.75 g/dm^3 of sulphates.

Although it’s not so easy to build model of bad wine, because parameters have larger diversity for bad wines.


Reflection

Dataset contains information about 1599 probes of wines with some information about quality marked by experts. Data is full without visible errors and mistakes. Athough there are not too much correlations between characteristics, and most of them are not directly related to quality, it’s possible to find some differencies between good and bad wines, while there are not enough information to say something important about average wine.

I think it will be hard to build some analytical model to predict quality of wine, because paramemers be not linearly separated, and there is large component of personal taste in quality rating. Anyway, it is possible to get some intuition about quality of wine based on amount of acids, sulphates and alcohol in each example.

I was interesting to realize that fixed acidity (tartaric acid, main component of wine acidity) in log10 scale have good anti-correlation with pH, which is log10 characteristic too.

I never thought about acids as important part of wine taste. It’s interesting what happen if I add some amount of lemon juice (where 6-8% of citric acid) to red wine with low acidity. It will be my next experiment.